Détection et correction automatique d'entités nommées dans des corpus OCRisés
Identifieur interne : 000067 ( Main/Exploration ); précédent : 000066; suivant : 000068Détection et correction automatique d'entités nommées dans des corpus OCRisés
Auteurs : Benoît Sagot [France] ; Kata Gábor [France]Source :
Abstract
Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000145
- to stream Hal, to step Curation: 000145
- to stream Hal, to step Checkpoint: 000027
- to stream Main, to step Merge: 000067
- to stream Main, to step Curation: 000067
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="fr">Détection et correction automatique d'entités nommées dans des corpus OCRisés</title>
<author><name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD"><idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc><address><addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation><relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct"><org type="institution" xml:id="struct-300301" status="VALID"><orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc><address><addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD"><idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc><address><addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation><relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct"><org type="institution" xml:id="struct-300301" status="VALID"><orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc><address><addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01022378</idno>
<idno type="halId">hal-01022378</idno>
<idno type="halUri">https://hal.inria.fr/hal-01022378</idno>
<idno type="url">https://hal.inria.fr/hal-01022378</idno>
<date when="2014-07-01">2014-07-01</date>
<idno type="wicri:Area/Hal/Corpus">000145</idno>
<idno type="wicri:Area/Hal/Curation">000145</idno>
<idno type="wicri:Area/Hal/Checkpoint">000027</idno>
<idno type="wicri:Area/Main/Merge">000067</idno>
<idno type="wicri:Area/Main/Curation">000067</idno>
<idno type="wicri:Area/Main/Exploration">000067</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="fr">Détection et correction automatique d'entités nommées dans des corpus OCRisés</title>
<author><name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD"><idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc><address><addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation><relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct"><org type="institution" xml:id="struct-300301" status="VALID"><orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc><address><addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD"><idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc><address><addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation><relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct"><org type="institution" xml:id="struct-300301" status="VALID"><orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc><address><addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
</list>
<tree><country name="France"><noRegion><name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
</noRegion>
<name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000067 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000067 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Hal:hal-01022378 |texte= Détection et correction automatique d'entités nommées dans des corpus OCRisés }}
This area was generated with Dilib version V0.6.32. |